Model comparison (done correctly) helps to choose the model that provides a good representation of the true DGP by penalizing models that “overfit”. This penalization is achieved mainly by assessing “fit” not on a training data set, but on a hold out test data set.
A complementary approach to work against “overfitting” is to specify priors that shrink model coefficients towards zero. Such shrinkage priors are typically normally distributed, have a mean of zero and a relatively small standard deviation. Here relative refers to the scale on which a predictor is measured.
To visualize how shrinkage works, we generate a small data set for which we want to estimate the mean:
set.seed(5)
y = (rnorm(15) %>% scale()) + 1
hist(y)
We want a Bayesian estimate of the mean of y \(\small y \sim normal(mu, 1)\) with two different priors \(\small mu \sim normal(0,\sigma)\) for mu:
par(mar = c(5,3,.5,.1))
curve(dnorm(x,0,1), -10, 10, n = 500,
ylab = "density", xlab = "mu", col = "blue", lwd = 3)
curve(dnorm(x,0,3), n = 500, add = T,
ylab = "density", xlab = "mu", col = adjustcolor("blue",alpha = .5), lwd = 3)
legend("topleft",
lty = 1, lwd = 3,
col = c("blue",adjustcolor("blue",alpha = .5)),
legend = c("mu ~ normal(0,1)","mu ~ normal(0,3)"),
bty = "n")
Here, we calculate the likelihood and the prior probability of different estimates for the mean of y
mus = seq(-1,3,.01)
P =
apply(
data.frame(mu = mus), 1,
function(mu) {
c(
mu = mu,
# P(data| mu)
llikelihood = sum(dnorm(y,mean = mu, log = T)),
# P(mu), mu ~ normal(0,1)
Psd1 = dnorm(mu,0, sd = 1, log = T),
# P(mu), mu ~ normal(0,3)
Psd3 = dnorm(mu,0, sd = 3, log = T)
)
}
)
P[,1:5]
## [,1] [,2] [,3] [,4] [,5]
## mu.mu -1.000000 -0.990000 -0.980000 -0.970000 -0.960000
## llikelihood -50.784078 -50.484828 -50.187078 -49.890828 -49.596078
## Psd1.mu -1.418939 -1.408989 -1.399139 -1.389389 -1.379739
## Psd3.mu -2.073106 -2.072001 -2.070906 -2.069823 -2.068751
Next, we calculate the posterior probabilities:
post1 = exp(P["llikelihood",] + P["Psd1.mu",])
post3 = exp(P["llikelihood",] + P["Psd3.mu",])
Now we can show likelihood, prior and posterior together:
par(mfrow = c(3,1), mar=c(2.75,2.75,0,.25), mgp=c(1.25,.1,0), tck=-.01)
plot(mus,exp(P["llikelihood",]),'l', xlab = "", ylab = "P(y|mu)", col = "red", lwd = 2)
curve(dnorm(x,0,1), min(mus), max(mus), n = 500,
ylab = "P(mu)", xlab = "", col = "blue", lwd = 3)
curve(dnorm(x,0,3), n = 500, add = T,
ylab = "density", xlab = "", col = adjustcolor("blue",alpha = .5), lwd = 3)
legend("topright",
lty = 1, lwd = 3,
col = c("blue",adjustcolor("blue",alpha = .5)),
legend = c("mu ~ normal(0,1)","mu ~ normal(0,3)"),
bty = "n")
plot(mus,post1,'l', xlab = "mu", ylab = "P(y|mu) * P(mu)", col = "purple", lwd = 2)
lines(mus,post3,'l', col = adjustcolor("purple",alpha = .5), lwd = 2)
abline(v = c(mus[which.max(post1)],mus[which.max(post3)]), lty = 3,
col = c("purple",adjustcolor("purple",alpha = .5)))
legend("topright",
lty = 1, lwd = 3,
col = c("purple",adjustcolor("purple",alpha = .5)),
legend = c("mu ~ normal(0,1)","mu ~ normal(0,3)"),
bty = "n")
We zoom in to see more clearly how the narrow prior shrinks the estimated mu towards zero
par(mar = c(5,3,.5,.1))
ids = which(mus > 0 & mus < 1.5)
plot(mus[ids],post1[ids],'l', xlab = "mu", ylab = "P(y|mu) * P(mu)", col = "purple", lwd = 2)
lines(mus[ids],post3[ids],'l', col = adjustcolor("purple",alpha = .5), lwd = 2)
abline(v = c(mus[which.max(post1)],mus[which.max(post3)]), lty = 3,
col = c("purple",adjustcolor("purple",alpha = .5)))
arrows(x0 = mus[which.max(post3)],
x1 = mus[which.max(post1)],
y0 = mean(c(max(post1),max(post3))),
lwd = 3, length = .15)
legend("topright",
lty = 1, lwd = 3,
col = c("purple",adjustcolor("purple",alpha = .5)),
legend = c("mu ~ normal(0,1)","mu ~ normal(0,3)"),
bty = "n")
To show that shrinkage works, we estimate spline models with different standard deviations on regression coefficients for the simulated income / well being data above.
The following figure shows the estimated relationships for different samples drawn from the population.